Executive Summary

Swiftkey is a text messaging application that predicts the next word as you type a message. When the user types: "I went to the " : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

In this project, we use R text mining tools to build a application that predicts the next word entered into an shiny application. We use data scraped from blogs, twitter and news sites to build a language model to perform prediction. We predict the next word by calculating the maximum likelihood estimate (MLE) based on the language model.

TL;DR

In a nutshell, here’s a summary of the data analysis performed in this report.

  1. Load the raw data.
  2. Extract a 1% subsample of the data.
  3. Preprocess the data to remove stopword, convert to lowercase and other tasks
  4. Generate 2-grams, 3-grams and 4-grams.
  5. Build a language model where a ngram (i.e. phrase) is a path for a search node in a tree.
  6. Calculate the maximum likelihood estimate (MLE) using the children of the search node.

Complete Dataset

The Data Science Capstone Course provided text data from 3 data sources: blogs, twitter and news. The table below shows the number of lines from data source.

filename line_count
1 en_US.blogs.txt 899288
2 en_US.news.txt 1010242
3 en_US.twitter.txt 2360148
4 Total 4269678

t Transforming Unstructured Text for Analysis

We use a framework for text mining applications within R to transform the unstructured text to a structured document-by-term matrix format required for analysis. The first step is “clean” the text prior with a series of text processing functions. We use the following preprocessing steps.

Second, we tokenize the text into ngrams. For our analysis, a gram equals a whitespace delimited sequence of characters corresponding to an English word. The N in ngram corresponds to the number of words considered to be part of the same unit. For example: a 1-gram is new, 2-gram is new york, 3-gram is new york city and a 4-gram is new york city police.

Finally, we build a document-by-term matrix by transforming the ngrams to a bag-of-words model. The columns in a docterm matrix each unique ngram. The row represent a document and the frequency that each ngram appears in the document.

Sampling the Dataset

The 4.26967810^{6} lines from the complete dataset can be memory intensive for the text mining tools and slow the analysis. To speed things up, we subsample 1% from the complete dataset and then work with the subsampled data for exploration and modeling. The subsampling implementation is in Appendix 1.

source num_lines num_unique_words mean_word_freq median_word_freq
twitter 23602 8040 20 9
blogs 8993 12414 15 6
news 10103 12850 15 7

The table above provides summary statistics for the sampled data. We see that the twitter data has fewer unique words - which makes sense given the 140-character limit. We also see that the mean word frequency is nearly twice the median frequency

Word Frequencies

For example, look at the word frequency distribution for the sample data

p <- all_docs_word_plot(sample_vector_corpus)
print(p)

NGram Frequencies

Let’s load all the data sources into 1 corpus.

docs <- load_sample_dircorpus()
docs <- preprocess_entries(docs)

Here are top bigrams.

ngram_2 <- get_docterm_matrix(docs, 2)
p2 <- generate_word_frequency_plot(ngram_2$wf, "Top Bigrams for Sampled Text")
print(p2)

Here are top tri-grams

ngram_3 <- get_docterm_matrix(docs, 3)
p3 <- generate_word_frequency_plot(ngram_3$wf, "Top Trigrams for Sampled Text")
print(p3)

Here are top 4-grams

ngram_4 <- get_docterm_matrix(docs, 4)
p4 <- generate_word_frequency_plot(ngram_4$wf, "Top 4-grams for Sampled Text")
print(p4)

Word Prediction using an NGrams

We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.

Word Prediction for: ‘data’

Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.

results <- perform_search(ngram_tree, c("data"))
print(results)
##                   12                   10                  
## recommended_words "entry"              "streams"           
## likelihood        "0.0606060606060606" "0.0505050505050505"
##                   8                    7                   
## recommended_words "recovery"           "dating"            
## likelihood        "0.0404040404040404" "0.0353535353535354"
##                   7                   
## recommended_words "personalize"       
## likelihood        "0.0353535353535354"

Word Prediction for: ‘data entry’

Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.

results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
##                   6      6        
## recommended_words "just" "respond"
## likelihood        "0.5"  "0.5"

Next Steps

  • Build a model using the more than a 1% sample.
  • Deploy the ngram tree to the server-side of an Shiny Application.

Appendix

Appendix 1: Subsampling Code

This code collects a 1% sample using a “coin flip” to decide which lines to choose.

# sample the datasci dir
sample_capstone_data <- function(fn, outfn, sample_len=0.01) {
  print(sprintf("Reading %s", fn))
  lines <- readLines(fn)
  set.seed(123)
  print(sprintf("Read %s Length %s", fn, length(lines)))
  lines_sample <- lines[rbinom(length(lines)*sample_len, length(lines), 0.5)]
  print(sprintf("Writing %s. Length %s", outfn, length(lines_sample)))
  write.csv(lines_sample, file=outfn, row.names=FALSE, col.names=FALSE)
}

sample_capstone_data("./data/final/en_US/en_US.twitter.txt",
                     "./data/final/en_US/sample/en_US.twitter.txt")
sample_capstone_data("./data/final/en_US/en_US.blogs.txt",
                     "./data/final/en_US/sample/en_US.blogs.txt")
sample_capstone_data("./data/final/en_US/en_US.news.txt",
                     "./data/final/en_US/sample/en_US.news.txt")